CLigOpt: controllable ligand design through target-specific optimization

Abstract Motivation A key challenge in deep generative models for molecular design is to navigate random sampling of the vast molecular space, and produce promising molecules that strike a balance across multiple chemical criteria. Fragment-based drug design (FBDD), using fragments as starting points, is an effective way to constrain chemical space and improve generation of biologically active molecules. Furthermore, optimization approaches are often implemented with generative models to search through chemical space, and identify promising samples which satisfy specific properties. Controllable FBDD has promising potential in efficient target-specific ligand design. Results We propose a controllable FBDD model, CLigOpt, which can generate molecules with desired properties from a given fragment pair. CLigOpt is a variational autoencoder-based model which utilizes co-embeddings of node and edge features to fully mine information from molecular graphs, as well as a multi-objective Controllable Generation Module to generate molecules under property controls. CLigOpt achieves consistently strong performance in generating structurally and chemically valid molecules, as evaluated across six metrics. Applicability is illustrated through ligand candidates for hDHFR and it is shown that the proportion of feasible active molecules from the generated set is increased by 10%. Molecular docking and synthesizability prediction tasks are conducted to prioritize generated molecules to derive potential lead compounds. Availability and implementation The source code is available via https://github.com/yutongLi1997/CLigOpt-Controllable-Ligand-Design-through-Target-Specific-Optimisation.


Introduction
The development of novel drugs is dependent on the efficient discovery of novel molecules (Bilodeau et al. 2022b).A critical challenge in this process is the difficulty in navigating the immense molecular space.Traditional methods, such as highthroughput screening, are inefficient as they require large resources and only result in a small number of hits (Zeng et al. 2022).Attention has been drawn to powerful generative deep learning techniques for efficient drug design due to their ability to learn a distribution for a given dataset and produce novel synthetic data from it.Deep generative models can explore chemical space efficiently through formulating a mapping between desired properties and corresponding molecular structures (Bilodeau et al. 2022a).Attributed to the success of syntactic pattern recognition in language modelling tasks, early molecular generative models are designed on the basis of SMILES, a line notation that describes chemical structures using ASCII strings.However, SMILES representation is limited by its intrinsic instability in describing molecular structural information, as structurally similar molecules can result in dissimilar SMILES strings (Pang et al. 2024).
Graph-based generative models are considered particularly suitable for molecular generation, as molecular structures are naturally well-represented by graphs (Cheng et al. 2021).With success in applications of generative models and graph neural networks (GNNs), many popular generative architectures such as variational autoencoder (VAE) (Liu et al. 2018), flow-based models (Jin et al. 2023), are adapted to graph data structures and applied on molecule generation tasks.A common generation strategy starts with a latent representation with no atoms and bonds, building molecular graphs through autoregressive or one-shot generation.However, such generation strategy utilizes random sampling to introduce novelty in the generated molecular structures.As molecule generation is extremely strict with precise validity and biological constraints, the basic graph-based molecular generation is unreliable and requires further control, which leads to the adoption of chemistry-aware constrained generation steps (e.g.Liu et al. 2018).
Fragment-based drug design (FBDD) approaches, which generate molecules conditioned on pre-existing sub-molecular units, constitute effective means to further constrain chemical space and improve generation of biologically active compounds (Mukaidaisi et al. 2022).Compared to random sampling schemes widely applied in exploring novel structures, using chemically meaningful fragments in molecular generation keeps search space chemically relevant and reduces the number of generation steps needed for frequent molecular motifs (Maziarz et al. 2021).Moreover, FBDD strategies can utilize information obtained from structureactivity relationship analysis, and can preserve such useful biological activity properties in the synthesized molecules.FBDD strategies have been introduced and successfully applied in fragment-to-lead studies via different techniques, such as scaffold hopping (Imrie et al. 2021), linker/PROTAC design (Imrie et al. 2020), and R-group optimization (Jin et al. 2023).
Optimization approaches are often integrated with generative models to navigate through the search space and identify promising initial seeds which can generate molecules with specified properties.Such optimization is usually implemented through heuristic optimization techniques, such as, Bayesian optimization (BO) (G� omez-Bombarelli et al. 2018), or reinforcement learning (RL) (Popova et al. 2018).However, these methods require additional computation while balancing exploration and exploitation within the explored space.For instance, BO may struggle with highdimensional search spaces as the number of features increases and the search space becomes increasingly sparse, thereby requiring more time and training data to find promising regions to sample.RL is limited in multi-objective optimization, as it requires large volume of data and involves a lot of computation.Whereas for molecular generation problem, the more conditions added to optimization objective, the less suitable molecules are for policy training.Therefore, controlling multiple attributes in molecular design efficiently remains challenging.
To generate molecules that preserve chemically meaningful fragments and combine multiple attributes relating to different properties, we propose a controllable graph-based deep generative model for fragment linking, CLigOpt, towards target-specific ligand generation.CLigOpt takes two substructures with desired chemical functions and generates a linker that connect them, resulting in a valid molecule.CLigOpt incorporates a Controllable Generation Module (CGM), which evaluates an initial sample with respect to its chemical properties before generation, so that linkers satisfy multiple chemical property requirements.Our model exhibits consistent and strong performance across six evaluation metrics in both structural feasibility assessment and chemical usefulness.A case study on designing inhibitors for hDHFR is used to demonstrate applicability, and the quality of generated inhibitors is assessed via molecular docking simulations and retrosynthesis reaction estimation.The top prioritized molecules are provided as potential leads for hDHFR inhibition, together with the predicted retrosynthesis reactions.
The main contributions of our work are as follows: (i) a VAE-based model for linker design, where co-embeddings of nodes and edges are utilized to improve the information learned from molecular graphs, is developed and evaluated, (ii) a multi-objective controllable molecular generation module, CGM, is introduced to generate promising target-specific ligands, (iii) comparisons with state-of-the-art methods and multiple metrics are reported to evaluate results of the generative process.

Data
The processed and filtered ZINC (Sterling and Irwin 2015) and CASF (Su et al. 2018) datasets were used as previously described (Imrie et al. 2020) for general-purpose model training and testing (41 74 997 examples from ZINC for training, 400 examples from ZINC and 309 examples from CASF for testing).Each example consisted of two fragments and a linker with size between 3 and 12 atoms.For target-specific linker design, 814 inhibitors and their binding affinities (pIC50) were obtained from ChEMBL (Gaulton et al. 2017), targeting human dihydrofolate reductase (hDHFR).The inhibitors were processed and fragmented following the procedure described in Hussain and Rea (2010), resulting in 1576 fragment pairs, 80% of which were used for training affinity predictors, and 20% selected for target-specific controllable generation.

Model framework
The proposed CLigOpt methodology is a graph-based VAE framework which takes two molecular fragments as input and generates a linker to form a complete plausible molecule.CLigOpt contains two components, a generative network θ that maps from a latent representation z to a sample x, providing a joint distribution oN where p(z) is a smooth distribution, p θ ðxjzÞ is the decoder that maps from z to x.An encoder ϕ maps from x to z, which is approximated by q ϕ ðzjxÞ to reduce the computational cost.
Our training and generative procedures are outlined in Fig. 1 and Algorithms S1 and S2.

Encoder
The workflow of encoder is shown in Fig. 1A.Two models [GCN (Kipf and Welling 2016) and CensNet (Jiang et al. 2023)] are employed as encoders and their ability to learn representations from hidden graph embeddings is investigated.Specifically, GCN is one of the most commonly used GNN architectures and its convolutions can be seen as message-passing through the node adjacency matrix, using only edge information to determine which nodes are adjacent to update node features.CensNet defines two types of layers, i.e. node and edge layer, which employ convolutions on line graphs, enabling the roles of nodes and edges to switch.Through convolution operations, co-embeddings of nodes and edges can be learned.Owing to this alternative updating rule, CensNet allows embedding nodes and edges to latent feature space simultaneously, as well as mutual enrichment over node and edge information.
We developed two version of CLigOpt models based on the two aforementioned networks.CLigOptGCN used three layers of GCNs as encoder, while the CLigOptCensNet encoder contained two node CensNet layers and one edge CensNet layer (see Fig. 1A).The means and variances, denoted as ðμ f ; σ f Þ and ðμ m ; σ m Þ, of fragment G F and molecule G M are obtained by the encoder, which can be formulated as: Then, the initial graph representation for decoder, z, is the sum of μ f and the combination of z m sampled from either the distribution N m ðμ m ; σ m Þ (for training) or N(0, 1) (for generation), and a latent node representation τ derived from z m through a linear layer where ½; � represents the operation of concatenation.

Decoder
Our training and generative procedures are explained in Algorithms S1 and S2.The decoder utilizes a similar sequential generation scheme inspired by CGVAE (Liu et al. 2018) where GNNs are applied to predict edge connection through a bond-by-bond manner (see Fig. 1B).We utilize the same graph embedding network engaged in Encoder (GCN and CensNet) to learn an informative representation of the generated graph.Networks are improved by adding layer normalization and dropout unit for each layer to reduce the impact of internal co-variate shift and prevent overfitting.
During generation, a set of nodes is initialized between the two fragments from which a linker is built through an iterative node update, edge selection and edge labelling process.At the beginning of each iteration, a focus node u is first selected for the decoder to predict its connectivity to the graph, including a pseudo stop node, subject to the valency constraints.The initial graph representation z is first be updated via a graph encoder layer.The graph representation z t at the tth step (iteration) is represented by, Then, for each remaining node v, a feature vector is constructed to predict the probability of an edge existing between u and v and the type of this edge.The feature vector integrates node-level information on the focus node and its edge candidates, as well as graph-level information for the initial graph representation and the current graph states.The feature vector for predicting the probability of an edge between u and v is provided by: ii64 Li et al.
where s t u and s t v are the concatenation of hidden state z t and atomic label representations of the focus node u and the remaining nodes v at step t, d v;u is the graph distance between v and u, H 0 is the average initial graph representation, H t is the average representation of the current graph state, D represents the 3D structural information of the two input fragments, and the vectors inside square brackets are concatenated.
At the end of each step, the edge with the highest-predicted probability is selected to connect to the focus node.The probability of an edge connecting to the stop node is also predicted through the same procedure.When the stop node is selected, the current focus node is finished and the nodes it connects then become the next nodes to focus.
The reconstructed molecule ĜM is then formed by xM and Adj t n f .

Training
Our model is trained with a standard VAE objective.For a given pair of fragments G F and a molecule G M , the model is trained to reconstruct G M from ðG F ; zÞ using teacher forcing (Williams and Zipser 1989), while enforcing the regularization constraint on the encodings of G F and G M .Thus, the objective function comprises two components, a reconstruction loss and a regularization loss: Lðθ; ϕÞ ¼ L recon ðθ; ϕÞ þ L KL ðϕÞ (8)

Target-specific controllable generation
We built a CGM referring to CLaSS (Das et al. 2021) that optimizes molecule generation for desirable properties.Given an initial sample z, and a set of independent attributes a 2 f0; 1g n ¼ fa 1 ; a 2 ; . . .; a n g, CLaSS trains a set of classifiers that predicts the probability of z getting accepted on each attribute a i .The process of conditional generation of a molecule x given attributes the independent attributes, pðxjaÞ, can then be formulated as: where pξ ðzjaÞ is derived using a density model Q ξ ðzÞ, in combination with a per-attribute classifier model q ξ ða i jzÞ, based on Bayes' rule and conditional independence of the attributes, leveraging rejection sampling from parametric approximations to pðzjaÞ.By enforcing multiple attribute constraints, the acceptance probability is the product of the scores from the attribute predictors, while sampling from the explicit density Q ξ ðzÞ.

Controllable generation module
We adapt CLaSS as our CGM aiming to generate molecules with high drug-likeness, high synthetic accessibility (SA) and high binding affinity to a target protein.A schematic representation of our module is illustrated in Fig. 1C.A Gaussian mixture density model is fitted over known molecules to estimate the posterior distribution of latent space z.From there, we design three classifiers in parallel, which take in a given latent embedding z and a pair of fragments encoded by CLigOpt, to evaluate the quantitative estimate of druglikeness (QED), SA score, and the negative logarithm of IC50 (pIC50) respectively, aligning with the aforementioned three standards.Rejection sampling then operates through sampling from the fitted density model.CGM evaluates whether an initial sample is likely to satisfy certain standards on selected attributes through the previously mentioned classifiers, and thus decides whether to proceed with generation with this seed or not.

Results
The performance of CLigOpt was measured in different aspects of molecular generation, i.e. (i) fragment linking and forming plausible molecules, assessed through benchmark linker design (benchmark evaluation on test sets) and (ii) generating target-specific drug-like molecules (controllable target-specific ligand generation), measured through a case study of generating lead candidates for hDHFR.Generated molecules were then prioritized and binding potential of the generated molecule set was evaluated via docking (evaluation via molecular docking).Finally, synthesizability via specific reaction steps of promising molecules are indicated (synthesizability of generated molecules).

Benchmark evaluation on test sets
The ability of CLigOpt in generating reasonable molecules was evaluated on two benchmark datasets, namely ZINC and CASF, and compared with two baseline models for fragment linking, namely DeLinker (Imrie et al. 2020) and FFLOM (Jin et al. 2023).DeLinker is a VAE-based model that generates molecules in bond-by-bond manner, whereas FFLOM is a flow-based model that carries out one-shot generation of molecules.For each test set, 250 molecules were generated for each fragment pair, the quality of generated molecules was assessed on the basis of (i) structural usefulness via validity, novelty and uniqueness metrics, and (ii) chemical usefulness, i.e.SA score, ring aromaticity, and Pan-Assay Interference compounds (PAINS).The results are shown in Fig. 2 and Supplementary Table S1.This indicates that DeLinker is able to generate molecules that are synthesizable and have good potential as drug candidates, but the generated set is repetitive and underperforms in novelty.On other hand, FFLOM can generate highly novel and unique molecules, however mostly difficult to synthesize.As a trade-off between these two cases, CLigOpt demonstrates competitive performance across all evaluation aspects.

Controllable target-specific ligand generation
A potential anticancer agent, hDHFR, is employed as target for a case study to investigate the ability of CGM to generate molecules with desired properties.QED, SA, and pIC50 were used as the optimization goals, aiming to generate molecular structures that fulfil pre-set conditions for three properties, (i) druglikeness, (ii) synthesizability, and (iii) binding activity to hDHFR.The three conditions correspond to three classifiers predicting from an initial sampling of the latent space, to evaluate whether CLigOpt can generate a molecule with QED > 0.6, SA < 3, and pIC50 > 6.The classifiers can be used together or independently to generate molecules with different property control.We take the test set of hDHFR inhibitors described in Section 'Data', and generate 10 molecules for each pair of fragments for random sampling and rejection sampling.

Ground truth property estimation
To evaluate the ability of CLigOpt in filtering out samples that did not meet the pre-set conditions, tools are employed to compute the true properties for the generated molecules as a guide.Due to concerns of introducing biases from overfitting, we utilize two pre-defined algorithms from rdkit (Bento et al. 2020) instead of neural networks, to compute the QED and SA score.Three machine learning approaches with diverse architectures were explored to predict pIC50: (i) transformer-CNN (Karpov et al. 2020), a SMILES-based Quantitative Structure-Activity Relationship (QSAR) model that learns molecular embeddings from a transformer encoder, and predicts the corresponding activity using CharNN, (ii) modSAR (Li et al. 2024, Cardoso-Silva et al., 2019a, 2019b), a two-stage QSAR model that involves detecting clusters of molecules, and applying mathematical optimization-based piecewise linear regression to link molecular structures to their biological activities, and (iii) DeepAffinity (Karimi et al. 2019), a semi-supervised deep learning model that utilizes a unified RNN and CNN model to predict binding affinity.We compare transformer-CNN and modSAR that are trained on a subset of hDHFR inhibitors, and DeepAffinity with the provided pre-trained parameters to select the most accurate model for predicting binding affinity.
Comparisons of the three models via the RMSE metric is shown in Supplementary Table S5.Transformer-CNN outperforms DeepAffinity and modSAR with RMSE ¼ 0.82, and is selected to predict pIC50 for the generated molecules as a reference to evaluate the performance of CLaSS module.The results of the normalized fraction of molecules that meet the pre-set conditions, for a specific property as well as balancing all three properties, in a random generated set compared with ii66 Li et al.
CLaSS-generated set are shown in Table 1.Evaluation on how CLaSS performs in a single property with different thresholds is shown in Supplementary Tables S2-S4.

Controllable generation
Results show that CGM excels in efficiently generating more useful molecules.Table 1 and Supplementary Tables S2-S4 report higher percentage of molecules meeting the pre-set conditions in CLaSS-generated set compared to the randomly sampled set, indicating that CLigOpt is able to identify the latent embeddings that result in molecules with desired properties.Supplementary Tables S2-S4 illustrate that, for QED and SA control, CLigOpt can generate a molecule set that contains three times more qualified molecules than random sampling.CLigOpt is less effective in pIC50 control, yet still better than random sampling.CLigOpt is also capable of multi-objective control, generating molecules that satisfy requirements for different goals.Moreover, the accepted set maintains a certain level of similarity to the original set in terms of QED and LogP (see Supplementary Fig. S2).

Evaluation via molecular docking
AutoDock docking (Trott and Olson 2010) was employed, as developed by DeepChem (Ramsundar et al. 2019), to further evaluate the binding of CLigOpt generated molecules.Molecular docking is conducted with the 3D structure of hDHFR on the CLaSS-generated molecules with 10 highestpredicted pIC50.We compare the binding free energy (BFE) of the generated molecules with 10 hDHFR inhibitors with highest binding affinities.Table 2 shows the results of average (E) and minimum (Min) BFE of these molecules, with respect to their binding affinities.Results indicate that CLigOpt is able to generate promising molecules that bind the druggable pocket of the 3D target structure with similar BFEs compared to known hDHFR inhibitors.

Synthesizability of generated molecules
The number of reaction steps required to complete the synthesis of a given molecule provides an indication of its complexity with respect to commercially available materials.We randomly selected 100 molecules from CLigOptGCN and CLigOptCensNet generated set obtained in controllable target-specific ligand generation, as well as the ChEMBL set as a baseline, to compute and compare the percentages of synthesizable molecules within each set.The retrosynthetic routes for each molecule were assessed using a molecular transformer-based model (Schwaller et al. 2019a(Schwaller et al. , 2019b)), and it was observed that the generated molecules have higher successful retrosynthesis prediction than the ChEMBL set, with success rate of 84% molecules from the ChEMBL set, 96% from both generated sets.The distribution of the number of steps needed for each set is shown in Fig. 3. Overall, compared to the baseline set, CLigOpt generated molecules exhibit promising results in synthesizability regarding retrosynthetic analysis.Most molecules can be synthesized within four reaction steps.CLigOptCensNet-generated molecules show better synthesizability, managing most retrosynthetic routes in 1 or 3 steps.ChEMBL and CLigOptGCN generated molecules show similar results in the amount of molecules that can be made within two steps, both achieving a peak at three steps.In comparison, more CLigOptGCN generated molecules require an additional step to synthesize than the ChEMBL molecules.We prioritize molecules from the two generated sets according to their QED, SA score, and pIC50, and provide the predicted retrosynthesis route for the best prioritized molecule in CLigOptCensNet in Fig. 4.

Conclusion
In this work, we introduce, CLigOpt, a pipeline for efficient optimized molecule generation that contains a CGM, in which the generative process is restricted via rejection sampling.CLigOpt shows advantages over existing optimization approaches in terms of reduced computational complexity, as it does not require heavy pre-training, surrogate model fitting, or policy learning.From comparing the quality of CLigOpt generated molecules with and without the property control module, we show that CLigOpt is an effective yet simple sampling scheme that improves the generation performance significantly and can handle multi-objective criteria well.
The evaluation of results is based on a variety of metrics that reflect the chemical and structural properties of the generated molecules.CLigOpt obtains consistent and strong performance  (2019b).Random samples of 100 molecules from each of our models and from 100 random molecules drawn from ChEMBL were used.
CLigOpt: controllable target-specific ligand design across all six metrics.Performance of CLigOpt is evaluated with two encoders, GCN and CensNet.CLigOptCensNet shows slightly better performance than CLigOptGCN in novelty, uniqueness, and ring aromaticity, suggesting that employing co-embedding for nodes and edges provides more information than sole node embedding, and can also improve performance for the ensuing generative model.The applicability of CLigOpt is illustrated via a case study on generating hDHFR ligands, evaluated through binding free energy simulation and retrosynthesis prediction.Generated molecules show similar binding compared to known hDHFR inhibitors, indicating the potential of CLigOpt to generate promising lead candidates.Future work can focus on a set of systematic optimization objectives to further restrict and optimize searching molecular space.

Figure 1 .
Figure 1.A diagrammatic representation of CLigOptCensNet.(A) An overview of the CensNet encoder, two fragments and a complete molecule are encoded into latent space, from where an initial representation for decoder is formed through the weighted sum of the average fragment representation and a sample is how the model Encoder encodes two molecules into the latent space, which is then decoded by our Decoder.(B) An overview of CensNet decoder, where the initial representation is applied through a sequential edge prediction process joining the two fragments.(C) The controllable generation module (CGM) where our trained encoder is used to encode molecules into the latent space to fit an explicit density model and attribute classifiers, and then sample from the fitted density model and attribute classifiers before decoding each point into the final molecule.

Figure 2 .
Figure 2. A radar plot showing the performance of our model as well as the two baselines.Although our model does not outperform the best model in each category, CLigOptCensNet achieves the best average performance.
Such process is repeated until every remaining node is visited.The largest connected component is returned as the predicted molecule.After predicting all edges, the node symbols xM is generated by a node prediction module MLP N xM ¼ MLP N ðzÞ

Table 1 .
The proportion of molecules that pass certain filters when sampling randomly versus when sampling using the CLaSS accepted set.

Table 2 .
Results summarizing molecular docking simulation.Figure 3. A histogram showing the proportion of molecules that could have a retrosynthetic path predicted by the model in Schwaller et al.